• CSV is common in the data science space as it is human-readable, less verbose than options like JSON and XML, and super-easy to produce from almost any tool. However, the format is usually underspecified and CSV files have terrible compression and performance. There are many file formats more suitable for working with tabular data. This post looks at one of them, Apache Parquet, and shows how it is better in both compression and performance with examples.

  • The typical business analyst query significantly underutilizes compute. Large language models use a lot more compute and may bring on another expansionary era of bloat. Most companies will not have the budget to operate large language models on big data - most business use cases will inevitably operate on less than a million rows of data. The shift to small data will mean that data egress will no longer be a moat and open up an opportunity for new GPU cloud competitors to emerge.

    Hi Impact